

General guideline for the demos:

These videos demonstrate semi-online segmentation with a delay of 4 seconds for streaming videos. Frame level predictions are made at inference time every 1 second for a given video. The proposed method was trained on weak labels having access to only videos’ sequence of actions and not the actions’ start/end time stamps at training time.
In this demo at the bottom you can observe the color-coded temporal segmentation of the video. Each color corresponds to an action class. Three segmentation results are being demonstrated. The ground-truth (gt), predictions of the proposed method and results of constrained Viterbi [A]. The previous method only follows the action sequences seen at training time failing to adapt to sequential variations and anomalies. In our proposed method, we define a criteria based on which deviations from transcripts are allowed so we are able to recognize some out of sequence actions.
On the right side of the screen detected errors are being printed. After segmenting the videos in atomic actions,  the algorithm follows pre-defined rules to detect errors. The errors are one of the pre-defined labels:
1) Intermediate idle time where the subject delays performing an action (background)
2) Dropping an item which does not follow with picking up that item
3) Not using all 4 legs when assembling the table
4) Not fastening all 4 legs of the table (loose legs)
5) Not inserting the 2 screws when assembling the record player in Task 3
6) Using excessive screws when assembling the record player in Task 3
7) Not fastening all screws when assembling the record player in Task 3 (loose screw)
8) Missing the ring in either of Task1 or Task3
9) Using an excessive ring when assembling the record player in Task 3
10) Not using all parts when assembling the airplane in Task 1
11) Not balancing the part when performing Task3
 


It is important to notice that none of these anomalies were observed at training time. For example no intermediate background/idle segment was seen at training. All training background instances occurred in the beginning or end of the video. Another example can be observed in video P14_2_T3_C4 where the subject inserts an extra screw when assembling the record player(T3). The previous work misses this action because all training samples were performed perfectly using only 2 screws. In contrast, our proposed method is able to detect such a deviation and recognizes the 3rd insertion.   

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[A] Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, Chiho Choi, and Behzad Dariush. Weakly-supervised online action segmentation in multi-view instructional videos. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13780–13790, 2022. 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



####################################################################################################################################
####################################################################################################################################
####################################################################################################################################


Description of a failed case (P11_1_T2_C3):


In general when assembling a table, on of the two common sub sequences take place:
a) insert_screw→take_block→spin_block
b) take_block → fasten_screw

One reason for the poor performance in  this video is that when assembling the second leg, the action of insert_screw is not well recognized, so consequently the proceeding actions follow patter B instead of pattern A. 
Another source of mistake is how P11 performs the idle (background ) state by floating his hand next to the leg without touching it. Because this type of Background is not seen at training time, the algorithm mistakenly predicts take_block and fasten_screw following pattern B.
While take_block for leg 3 and leg 4 are correctly recognized, the system again is confused on the Background action that follows so consequently predicts fasten_screw due to occlusion and the hand pose of the subject. 
The final false prediction of insert_pin is due to a wrong deviation from transcripts and wrong prediction of the classifier potentially confused by the use of plate.


